Workshop on Massive Geometric Data Sets 2005

نویسندگان

  • Lars Arge
  • Mark de Berg
  • Jan Vahrenhold
چکیده

Heavy hitters, which are items occurring with frequency above a given threshold, are an important aggregation and summary tool when processing data streams or data warehouses. Hierarchical heavy hitters (HHHs) have been introduced as a natural generalization for hierarchical and multi-dimensional data domains. An important application, for instance, involves inferring patterns in the stream of IP packets in the Internet, for the purpose of traffic engineering or network security. Each IP packet is mapped to a multi-dimensional point using its header fields, and the goal is to extract classification rules that account for a large fraction of the traffic. In geometric terms, the problem involves identifying rectangular boxes that contain a significant fraction of a stream of points, excluding those points counted in a smaller heavy hitter box. Formally, consider a stream S of d-dimensional points and a family B of axis-aligned d-dimensional boxes. Both |S| and |B| are large—the points of the stream arrive online, and they are too numerous to be stored in memory; the boxes also are too numerous to be maintained explicitly and are defined implicitly by certain rules. The boxes in B are partially ordered by the containment relation: for two boxes B and B′, we write B′ ≺ B if B′ ⊂ B. We define the frequency (or, population) of a box B to be the number of points in the stream that lie in B, namely, |B ∩ S|; we use the notation S to also denote the part of the stream seen so far. We are interested in identifying those boxes with frequency greater than φ|S|, for a given parameter 0 < φ < 1. In order to avoid redundancy, however, hierarchical heavy hitters are defined using the discounted frequency of a box. (Otherwise, all boxes containing a heavy box may be flagged as heavy even if they do not contain many additional points.) Discounted frequencies and φ-hierarchical heavy hitters (φ-HHHs) are defined recursively: the discounted frequency of B counts only those points that lie in B but not in another φ-HHH B′ where B′ ≺ B. A box B is a φ-HHH if its discounted frequency exceeds φ|S|. Hierarchical heavy hitters are natural and powerful constructs, but reliably estimating the discounted frequency of boxes has proved elusive. None of the known space-efficient data stream algorithms offer a worst-case guarantee on the approximation quality of the boxes they flag as φ-HHH . In this talk, we formalize the difficulty of computing true hierarchical heavy hitters and prove lower bounds on the space complexity of algorithms that compute them. For streams of 1-dimensional data, we give an Ω(1/φ) space lower bound for any algorithm, using an information-theoretic argument. To prove lower bounds for streams of multi-dimensional data and to establish stronger space bounds, we limit our discussion to a simple model of deterministic algorithms, which we call the box frequency model. In this model, an algorithm with space bound s is allowed s distinct counters, and each counter maintains the frequency of a box. We show that any single-pass deterministic scheme that computes φ-HHHs for d-dimensional data in the box frequency model with any bounded approximation guarantee must use Ω(1/φ) space. This bound is asymptotically tight as we can show a deterministic data stream algorithm (in the box frequency model) that computes φ-HHHs with constant approximation error, using O(1/φ) memory. ∗Mentor Graphics Corp., 8005 SW Boeckman Road, Wilsonville, OR 97070, USA, [email protected] and (by courtesy) Department of Computer Science, University of California at Santa Barbara. †Department of Computer Science, University of California at Santa Barbara, Santa Barbara, CA 93106, USA, {nisheeth, suri}@cs.ucsb.edu. ‡Department of Mathematics, Room 2-336, MIT, Cambridge, MA 02139, USA, [email protected].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MMDS 2014: Workshop on Algorithms for Modern Massive Data Sets

The 2014 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2014) will address algorithmic and statistical challenges in modern large-scale data analysis. The goals of MMDS 2014 are to explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearly-structured scientific and internet data sets; and to bring together computer scientists, statisticians, mathem...

متن کامل

Foundations of Data Mining via Granular and Rough Computing

This workshop introduces the foundations of data mining in the context of granular and rough computing. Unlike conventional data mining research, this workshop focuses on deep reflection into the meaning of patterns obtained from data. Since patterns are embedded in sets of real world objects, algebraic or geometric views on sets may play important roles. For this purpose, granular and rough co...

متن کامل

Computational Geometry – 99102

Geometric computing is creeping into virtually every corner of science and engineering, from design and manufacturing to astrophysics and cartography. This report describes presentations made at a workshop focused on recent advances in this computational geometry field. Previous Dagstuhl workshops on computational geometry dealt mostly with theoretical issues: the development of provably effici...

متن کامل

MMDS 2008 : Algorithmic and Statistical Challenges in Modern Large - Scale Data Analysis , Part I

The 2008 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2008), held at Stanford University, June 25–28, had two goals: first, to explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearly structured scientific and Internet data sets, and second, to bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to...

متن کامل

Computation in Large-Scale Scientific and Internet

A report is provided for the ACM SIGKDD community about the 2010 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2010), its origin in MMDS 2006 and MMDS 2008, and future directions for this interdisciplinary research area.

متن کامل

MMDS 2008: Algorithmic and Statistical Challenges in Mod- ern Large-Scale Data Analysis are the Focus

The 2008 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2008), sponsored by the NSF, DARPA, LinkedIn, and Yahoo!, was held last year at Stanford University, June 25–28, 2008. The goals of MMDS 2008 were (1) to explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearly-structured scientific and internet data sets; and (2) to bring together computer ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005